Back

Bachelor Thesis Project

Evaluating the effectiveness of Twitter Sentiment Analysis as a predictive tool for the stock market.

Aim

The goal of this thesis was to build a time series representing the sentiment polarity of tweets related to a selected group of companies, and compare it to the corresponding time series of their stock market behavior.

The selected companies were: Apple, Google, Nike, Nestlé, Beyond Meat, Bayer, and NovaVax.

Data Acquisition

I collected tweets using Twitter's API, writing all the code in R. I automated the data collection process using Windows Task Scheduler, ensuring that tweets were downloaded daily at the same time.

The downloaded data was automatically uploaded to OneDrive for remote access. Additionally, I implemented an automated email notification system via Gmail that confirmed the successful execution of the download and included basic statistics about the collected data.

Sentiment Analysis

I preprocessed the data by cleaning and lemmatizing the tweets, and then applied three sentiment analysis methods to compute a polarity score for each tweet:
  1. Naive Bayes
    Based on Bayes’ Theorem, this classifier labeled tweets as "positive" or "negative" using the MPQA Subjectivity Lexicon by Janyce Wiebe.
  2. Syuzhet
    Used the Syuzhet R package and its associated dictionary to assign sentiment scores.
  3. Udpipe
    Applied the Udpipe R package with the MPQA lexicon. This method supports intensifiers, weakeners, and modifiers, enabling it to differentiate between phrases like "good", "very good", "quite good", and "not good".

Conclusion

To evaluate whether a causal relationship exists between tweet sentiment and a company’s stock price, I developed a custom test based on the Granger Causality Test, which I named the Close Test. The results were promising, revealing several statistically significant causal relationships.

Interestingly, the test found that:

A more detailed analysis of the visualizations and findings can be accessed in the final report or by visiting the corresponding GitHub repository.

Tags

R API Statistics Automated Taks Time Series Analysis Sentiment Analysis udpipe Text Analysis